Pyspark with error self. | 您所在的位置:网站首页 › socket connect timed out › Pyspark with error self. |
Always avoid using UDFs when you can use Spark built-in functions. You can rewrite your logic using when function like this: from pyspark.sql import functions as F def get_include_col(): c = F.when((F.col("curr_year") == F.col("start_year")) & (F.col("curr_month") >= F.col("start_month")), F.lit(1)) \ .when((F.col("curr_year") == F.col("end_year")) & (F.col("curr_month") F.col("start_year")) & (F.col("curr_year") < F.col("end_year")), F.lit(1)) \ .otherwise(F.lit(0)) return c temp = temp.withColumn('include', get_include_col())You can also use functools.reduce to dynamically generate the when expressions without having to tape all of them. For example: import functools from pyspark.sql import functions as F cases = [ ("curr_year = start_year and curr_month >= start_month", 1), ("curr_year = end_year and curr_month start_year and curr_year < end_year", 1) ] include_col = functools.reduce( lambda acc, x: acc.when(F.expr(x[0]), F.lit(x[1])), cases, F ).otherwise(F.lit(0)) temp = temp.withColumn('include', include_col) |
CopyRight 2018-2019 实验室设备网 版权所有 |